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Abstract 

A fundamental question in biology is the following: what is the time scale that is needed for evolutionary innovations? 
There are many results that characterize single steps in terms of the fixation time of new mutants arising in populations of 
certain size and structure. But here we ask a different question, which is concerned with the much longer time scale of 
evolutionary trajectories: how long does it take for a population exploring a fitness landscape to find target sequences that 
encode new biological functions? Our key variable is the length, L, of the genetic sequence that undergoes adaptation. In 
computer science there is a crucial distinction between problems that require algorithms which take polynomial or 
exponential time. The latter are considered to be intractable. Here we develop a theoretical approach that allows us to 
estimate the time of evolution as function of L. We show that adaptation on many fitness landscapes takes time that is 
exponential in L, even if there are broad selection gradients and many targets uniformly distributed in sequence space. 
These negative results lead us to search for specific mechanisms that allow evolution to work on polynomial time scales. We 
study a regeneration process and show that it enables evolution to work in polynomial time. 
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Introduction 

Our planet came into existence 4.6 billion years ago. There is 
clear chemical evidence for life on earth 3.5 billion years ago [1,2]. 
The evolutionary process generated procaria, eucaria and 
complex multi-cellular organisms. Throughout the history of life, 
evolution had to discover sequences of biological polymers that 
perform specific, complicated functions. The average length of 
bacterial genes is about 1000 nucleotides, that of human genes 
about 3000 nucleotides. The longest known bacterial gene 
contains more than 10 s nucleotides, the longest human gene 
more than 10 6 . A basic question is what is the time scale required 
by evolution to discover the sequences that perform desired 
functions. While many results exist for the fixation time of 
individual mutants [3-15], here we ask how the time scale of 
evolution depends on the length L of the sequence that needs to be 
adapted. We consider the crucial distinction of polynomial versus 
exponential time [16-18]. A time scale that grows exponentially in 
L is infeasible for long sequences. 

Evolutionary dynamics operates in sequence space, which can 
be imagined as a discrete multi-dimensional lattice that arises 
when all sequences of a given length are arranged such that 
nearest neighbors differ by one point mutation [19]. For constant 
selection, each point in sequence space is associated with a non- 
negative fitness value (reproductive rate). The resulting fitness 
landscape is a high dimensional mountain range. Populations 
explore fitness landscapes searching for elevated regions, ridges, 
and peaks [20-27]. 

A question that has been extensively studied is how long does it 
take for existing biological functions to improve under natural 



selection. This problem leads to the study of adaptive walks on 
fitness landscapes [1 5,20,2 1 ,28,29] . In this paper we ask a different 
question: how long does it take for evolution to discover a new 
function? More specifically, our aim is to estimate the expected 
discovery time of new biological functions: how long does it take 
for a population of reproducing organisms to discover a biological 
function that is not present at the beginning of the search. We will 
discuss two approximations for rugged fitness landscapes. We also 
discuss the significance of clustered peaks. 

We consider an alphabet of size four, as is the case for DNA and 
RNA, and a nucleotide sequence of length L. We consider a 
population of size N, which reproduces asexually. The mutation 
rate, u, is small: individual mutations are introduced and evaluated 
by natural selection and random drift one at a time. The 
probability that the evolutionary process moves from a sequence 
to a sequence j, which is at Hamming distance one from (, is given 
by Pij = [Nu/(3L)]pij, where p t j is the fixation probability of 
sequence j in a population consisting of sequence i. In the special 
case of a flat fitness landscape, we have p { j = \/N, and 
Pi j = [u/(3L)]. Thus we have an evolutionary random walk, 
where each step is a jump to a neighboring sequence of Hamming 
distance one. 

Results 

Consider a high-dimensional sequence space. A particular 
biological function can be instantiated by some of the sequences. 
Each sequence i has a fitness valued, which measures the ability of 
the sequence i to encode the desired function. Biological fitness 
landscapes are typically expected to have many peaks [29-31]. 



PLOS Computational Biology | www.ploscompbiol.org 



1 



September 2014 | Volume 10 | Issue 9 | e1 00381 8 



The Time Scale of Evolutionary Innovation 



Author Summary 

Evolutionary adaptation can be described as a biased, 
stochastic walk of a population of sequences in a high 
dimensional sequence space. The population explores a 
fitness landscape. The mutation-selection process biases 
the population towards regions of higher fitness. In this 
paper we estimate the time scale that is needed for 
evolutionary innovation. Our key parameter is the length 
of the genetic sequence that needs to be adapted. We 
show that a variety of evolutionary processes take 
exponential time in sequence length. We propose a 
specific process, which we call 'regeneration processes', 
and show that it allows evolution to work on polynomial 
time scales. In this view, evolution can solve a problem 
efficiently if it has solved a similar problem already. 

They can be highly rugged due to epistatic effects of mutations 
[32-34]. They can also contain large regions or networks of 
neutrality [20,21]. Empirical studies of short RNA sequences have 
revealed that the underlying fitness landscape has low peak density 
[35]: around 15 peaks in 4 24 sequences. 

For the purpose of estimating the expected discovery time we 
can approximate the fitness landscape with a binary step function 
over the sequence space. We discuss two different approximations 
(Figure 1). For the first approximation, we consider the scenario 
where fitness values below some threshold, / m i n , have negligible 
contribution; those sequences do not instantiate the desired 
function (either not at all or only below the minimum level that 
could be detected by natural selection). We approximate the 
rugged fitness landscape as follows: if ft < /mjn then f, = 0; if 
fi >./min then fi=\. The set of sequences with f, >f m i„ constitutes 
the target set, and the remaining fitness landscape is neutral. 

The second approximation works as follows. Consider the 
evolutionary process exploring a rugged fitness landscape where 
the goal is to attain a fitness level /*. Local maxima below /* slow 
down the evolutionary process to attain /*, because the 
evolutionary walk might get stuck in those local maxima. In order 
to derive lower bounds for the expected discovery time, the rugged 
fitness landscape can be approximated as follows. Let / be the 
fitness value of the highest local maximum below /*. Then for 
every sequence in a mountain range with a local maximum below 
/* we assign the fitness value /. The mountain ranges with local 
maxima above /* are the target sequences. Note that the target set 
includes sequences that start at the upslope of mountain ranges 
with peaks above /*. Thus, again we obtain a fitness landscape 
with clustered targets and neutral region, where the neutral region 
consists of all sequences whose fitness values have been assigned to 
f. The two approximations are illustrated in Figure 1. For 
/* =/niin the second approximation generates larger target areas 
than the first approximation and is therefore more lenient. 

Our key results for estimating the discovery time can now be 
formulated for binary fitness landscapes, but they apply to any 
type of rugged landscape using one of the two approximations. We 
note that our methods can also be applied for certain non-binary 
fitness landscapes, and an example of a fitness landscape with a 
large gradient arising from multiplicative fitness effects is discussed 
in Sections 6 and 7 of Text SI. 

We now present our main results in the following order. We first 
estimate the discovery time of a single search aiming to find a 
single broad peak. Then we study multiple simultaneous searches 
for a single broad peak. Finally, we consider multiple broad peaks 
that are uniformly randomly distributed in sequence space. 



We first study a broad peak of target sequences described as 
follows: consider a specific sequence; any sequence within a certain 
Hamming distance of that sequence belongs to the target set. 
Specifically, we consider that the evolutionary process has 
succeeded, if the population discovers a sequence that differs 
from the specific sequence in no more than a fraction c of 
positions. We refer to the specific sequence as the target center and 
c as the width (or radius) of the peak. For example, if L = 100 and 
c = 0.1, then the target center is surrounded by a cloud of 
approximately 10 18 sequences. For a single broad peak with width 
c, the target set contains at least 2 cL /(3L) sequences, which is an 
exponential function of L. The fitness landscape outside the broad 
peak is flat. We refer this binary fitness landscape as a broad peak 
landscape. The population needs to discover any one of the target 
sequences in the broad peak, starting from some sequence that is 
not in the broad peak. We establish the following result. 

Theorem 1. Consider a single search exploring a broad peak 
landscape with width c and mutation rate u. The following 
assertions hold: 

• if c<3/4, then there exists LqeN such that for all sequence 
spaces of sequence length L>Lq, the expected discovery time is 

at least exp[(3— 4c) — log — - — 1; 
FL 16 6 4e + 3 J 

• if c > 3 /4, then for all sequence spaces of sequence length L, the 
expected discovery time is at most 0(L? /u). 

Our result can be interpreted as follows (see Theorem S2 and 
Corollary S2 in Text SI): (i) If c< 3/4, then the expected discovery 
time is exponential in L; and (ii) if c > 3 /4, then the expected 
discovery time is polynomial in L. Thus, we have derived a strong 
dichotomy result which shows a sharp transition from polynomial 
to exponential time depending on whether a specific condition on 
c does or does not hold. 

For the four letter alphabet most random sequences have 
Hamming distance 3L/4 from the target center. If the population 
is further away than this Hamming distance, then random drift 
will bring it closer. If the population is closer than this Hamming 
distance, then random drift will push it further away. This 
argument constitutes the intuitive reason that c = 3/4 is the critical 
threshold. If the peak has a width of less than c = 3/4, then we 
prove that the expected discovery time by random drift is 
exponential in the sequence length L (see Figure 2). This result 
holds for any population size, TV, as long as 4 L > >N, which is 
certainly the case for realistic values of L and N. In the Text SI we 
also present a more general result, where along with a single broad 
peak, instead of a flat landscape outside the peak we consider a 
multiplicative fitness landscape and establish a sharp dichotomy 
result that generalizes Theorem 1 (see Corollary S2 in Text SI). 

Remark 1. We highlight two important aspects of our results. 

1. First, when we establish exponential lower bounds for the 
expected discovery time, then these lower bounds hold even if the 
starting sequence is only a few steps away from the target set. 

2. Second, we present strong dichotomy results, and derive 
mathematically the most precise and strongest form of the 
boundary condition. 

Let us now give a numerical example to demonstrate that 
exponential time is intractable. Bacterial life on earth has been 
around for at least 3.5 billion years, which correspond to 3 x 10 13 
hours. Assuming fast bacterial cell division of 20-30 minutes on 
average we have at most 10 14 generations. The expected discovery 
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Figure 1. Approximations of a highly rugged fitness landscape by broad peaks and neutral regions. The figures depict examples of 
highly rugged fitness landscapes where the sequence space has been projected in one dimension. (A) Sequences with fitness below some level f m j v 
are functionally very different to the desired function, and selection cannot act upon them. All other sequences are considered as targets. The fitness 
landscape is approximated by a step function: if fi<f m m, then /j = 0, otherwise / = 1. (B) Local maxima below the desired fitness threshold /* are 
known to slow down the evolutionary random walk towards sequences that attain fitness at least f *. We approximate the fitness landscape by broad 
peaks and neutral regions by increasing the fitness of every sequence that belongs in a mountain range with fitness below/* to the maximal local 
maxima / below /*. Note that the target set starts from the upslope of a mountain range whose peak exceeds /*. 
doi:1 0.1 371 /journal.pcbi.1 00381 8.g001 



time for a sequence of length L = 1 000 with a very large broad 
peak of C = l/2 is approximately 10 65 generations; see Table 1. 

If individual evolutionary processes cannot find targets in 
polynomial time, then perhaps the success of evolution is based on 
the fact that many populations are searching independently and in 
parallel for a particular adaptation. We prove that multiple, 
independent parallel searches are not the solution of the problem, 
if the starting sequence is far away from the target center. Formally 
we show the following result. 



Theorem 2. In all cases where the lower bound on the expected 
discovery time is exponential, for all polynomials p\{), P2O and 
Pi(-), for any starting sequence with Hamming distance at least 
3L/4from the target center, the probability for any one out of pj,(L) 
independent multiple searches to reach the target set within pi(L) 
steps is at most 1 //)?(£). 

If an evolutionary process takes exponential time, then 
polynomially many independent searches do not find the target 
in polynomial time with reasonable probability (for details see 



exp(L) 
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a> — 



poly(L) 




Hamming Distance 



50 100 150 200 250 300 350 

Sequence length, L 



Figure 2. Broad peak with different fitness landscapes. For the broad peak there is a specific sequence, and all sequences that are within 
Hamming distance cL are part of the target set. The fitness landscape is flat outside the broad peak. (A) If the width of the broad peak is c<3/4, then 
the expected discovery time is exponential in sequence length, L. (B) If the width of the broad peak is c>3/4, then the expected discovery time is 
polynomial in sequence length, L. (C) Numerical calculations for broad peak fitness landscapes. We observe exponential expected discovery time for 
c=l/3 and c=l/2, whereas polynomial expected discovery time for c = 3/4. 
doi:1 0.1 371 /journal.pcbi.1 00381 8.g002 
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Table 1. Numerical data for discovery time in flat fitness landscapes. 
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Numerical data for the discovery time of broad peaks with width c= 1/3,1/2, and 3 /4 embedded in flat fitness landscapes. First the discovery time is computed for 
small values of L as shown in Figure 2(C). Then the exponential growth is extrapolated to L= 100 and L= 1000, respectively. We show the discovery times for c= 1 /2, 
and 1/3. For c = 3/4 the values are polynomial in L. 
doi:l 0.1 371 /journal.pcbi.l 00381 8.t001 



Theorem S5 in the Text SI). We also show an informal and 
approximate calculation of the success probability for M 
independent searches, as follows: if the expected discovery time 
is exponential (say, d), then the probability that all M independent 
searches fail upto b steps is at least exp( — (Mb)/d) (i.e., the 
success probability within b steps of any of the searches is at most 
1 — exp( — (Mb)/d)), when the starting sequence is far away from 
the target center. In such a case, one could quickly exhaust the 
physical resources of an entire planet. The estimated number of 
bacterial cells [36] on earth is about 10 30 . To give a specific 
example let us assume that there are 10 24 independent searches, 
each with population size N = 1 0 6 . The probability that at least 
one of those independent searches succeeds within 10 14 genera- 
tions for sequence length L = 1 000 and broad peak of c = 1 jl is 
less than lO" 26 . 

In our basic model, individual mutants are evaluated one at a 
time. The situation of many mutant lineages evolving in parallel is 
similar to the multiple searches described above. As we show that 
whenever a single search takes exponential time, multiple 
independent searches do not lead to polynomial time solutions, 
our results imply intractability for this case as well. 

We now explore the case of multiple broad peaks that are 
uniformly and randomly distributed. Consider that there are m 
target centers. Around each target center there is a selection 
gradient extending up to a distance cL. Formally we can consider 
any fitness function / that assigns zero fitness to a sequence whose 
Hamming distance exceeds cL from all the target centers, which in 
particular is subsumed by considering the multiple broad peaks 
where around each center we consider a broad peak of target set 
with peak width c. We establish the following result: 

Theorem 3. Consider a single search under the multiple broad 
peak fitness landscape of m< < 4 L target centers chosen uniformly 
at random, with peak width at most c for each center and c < 3/4. 
Then with high probability, the expected discovery time of the target 
set is at least (l/«?)exp[2L(3/4 — c) 2 ]. 

Whether or not the function (\/m) exp[2L(3/4 — c) 2 ] is 
exponential in L depends on how m changes with L. But even if 
we assume exponentially many broad peak centers, m, with peak 
width cL where c < 3/4, we need not obtain polynomial time 
(Figure 3 and Theorem S6 in Text SI). 

It is known that recombination may accelerate evolution on 
certain fitness landscapes [28,37-39], and recombination may also 
slow down evolution on other fitness landscapes [40]. Recombi- 
nation, however, reduces the discovery time only by at most a 
linear factor in sequence length [28,37,38,41,42]. A linear or even 
polynomial factor improvement over an exponential function does 
not convert the exponential function into a polynomial one. 
Hence, recombination can make a significant difference only if the 
underlying evolutionary process without recombination already 
operates in polynomial time. 



What are then adaptive problems that can be solved by 
evolution in polynomial time? We propose a "regeneration 
process". The basic idea is that evolution can solve a new 
problem efficiently, if it is has solved a similar problem already. 
Suppose gene duplication or genome rearrangement can give rise 
to starting sequences that are at most k point mutations away from 
the target set, where A: is a number that is independent of L. It is 
important that starting sequences can be regenerated again and 
again. We prove that L k+l many searches are sufficient in order to 
find the target in polynomial time with high probability (see 
Figure 4 and Section 10 in Text SI). The upper bound, L k + l , 
holds even for neutral drift (without selection). Note that in this 
case, the expected discovery time for any single search is still 
exponential. Therefore, most of the L k+1 searches do not succeed 
in polynomial time; however, with high probability one of the 
searches succeeds in polynomial time. There are two key aspects to 
the "regeneration process": (a) the starting sequence is only a small 
number of steps away from the target; and (b) the starting 
sequence can be generated repeatedly. This process enables 
evolution to overcome the exponential barrier. The upper bound, 
L k + l , may possibly be further reduced, if selection and/or 
recombination are included. 

Discussion 

The regeneration process formalizes the role of several existing 
ideas. First, it ties in with the proposal that gene duplications and 
genome rearrangements are major events leading to the emer- 
gence of new genes [43]. Second, evolution can be seen as a 
tinkerer playing around with small modifications of existing 
sequences rather than creating entirely new ones [44] . Third, the 
process is related to Gillespie's suggestion [29] that the starting 
sequence for an evolutionary search must have high fitness. In our 
theory, proximity in fitness value is replaced by proximity in 
sequence space. However, our results show that proximity alone is 
insufficient to break the exponential barrier, and only when 
combined with the process of regeneration it yields polynomial 
discovery time with high probability. Our process can also explain 
the emergence of orphan genes arising from non-coding regions 
[45] . Section 1 2 of the Text S 1 discusses the connection of our 
approach to existing results. 

There is one other scenario that must be mentioned. It is 
possible that certain biological functions are hyper-abundant in 
sequence space [2 1] and that a process generating a large number 
of random sequences will find the function with high probability. 
For example, Bartel & Szostak [46] isolated a new ribozyme from 
a pool of about 10 15 random sequences of length L = 220. While 
such a process is conceivable for small effective sequence length, it 
cannot represent a general solution for large L. 

Our theory has clear empirical implications. The regeneration 
process can be tested in systems of in vitro evolution [47]. A 
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Figure 3. The search for randomly, uniformly distributed targets in sequence space. (A) The target set consists of;?; random sequences; 
each one of them is surrounded by a broad peak of width up to cL. The figure shows a pictorial illustration where the L-dimensional sequence space 
is projected onto two dimensions. From a randomly chosen starting sequence outside the target set, the expected discovery time is at least 
(l/m)exp[2L(3/4-c) 2 ], which can be exponential in L. (B) Computer simulations showing the average discovery time of m= 100, 150, and 200 
targets, with c= 1/3. We observe exponential dependency on L. The discovery time is averaged over 200 runs. (C) Success probability estimated as 
the fraction of the 200 searches that succeed in finding one of the target sequences within 10 4 generations. The success probability drops 
exponentially with L. (D) Success probability as a function of time for L = 42, 45, and 48. (E) Discovery time for a large number of randomly generated 
target sequences. Either m = 2 L ' 3+2 or m = 4 L / 3 sequences were generated. For ft = 0 and b = 3 the target set consists of balls of Hamming distance 0 
and 3 (respectively) around each sequence. The figure shows the average discovery time of 100 runs. As expected we observe that the discovery time 
grows exponentially with sequence length, L. 
doi:1 0.1 371 /journal.pcbi.1 00381 8.g003 
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Figure 4. Regeneration process. Gene duplication (or possibly some 
other process) generates a steady stream of starting sequences that are 
a constant number k of mutations away from the target. Many searches 
drift away from the target, but some will succeed in polynomially many 
steps. We prove that L k + l searches ensure that with high probability 
some search succeed in polynomially many steps. 
doi:1 0.1 371 /journal.pcbi.1 00381 8.g004 



starting sequence can be generated by introducing k point 
mutations in a known protein encoding sequence of length L. If 
these point mutations destroy the function of the protein, then the 
expected discovery time of any one attempt to find the original 
sequence should be exponential in L. But only polynomially many 
searches in L are required to find the target with high probability 
in polynomially many steps. The same setup can be used to 
explore whether the biological function can be found elsewhere in 
sequence space: the evolutionary trajectory beginning with the 
starting sequence could discover new solutions. Our theory also 
highlights how important it is to explore the distribution of 
biological functions in sequence space both for RNA [20,21,35,46] 
and in the protein universe [48]. 

In summary, we have developed a theory that allows us to 
estimate time scales of evolutionary trajectories. We have shown 
that various natural processes of evolution take exponential time as 
function of the sequence length, L. In some cases we have 
established strong dichotomy results for precise boundary condi- 
tions. We have proposed a mechanism that allows evolution in 
polynomial time scales. Some interesting directions of future work 
are as follows: (1) Consider various forms of rugged fitness 
landscapes and study more refined approximations as compared to 



PLOS Computational Biology | www.ploscompbiol.org 



5 



September 2014 | Volume 10 | Issue 9 | e1 00381 8 



The Time Scale of Evolutionary Innovation 



the ones we consider; and then estimate the expected discovery 
time for the refined approximations. (2) While in this paper we 
characterize the difference between exponential and polynomial 
for the expected discovery time, more refined analysis (such as 
efficiency for polynomial time, like cubic vs quadratic time) for 
specific fitness landscapes using mechanisms like recombination is 
another interesting problem. 

Materials and Methods 

Our results are based on a mathematical analysis of the 
underlying stochastic processes. For Markov chains on the one- 
dimensional grid, we describe recurrence relations for the 
expected hitting time and present lower and upper bounds on 
the expected hitting time using combinatorial analysis (see Text 
SI for details). We now present the basic intuitive arguments of 
the main results. 

Markov chain on the one-dimensional grid 

For a single broad peak, due to symmetry we can interpret the 
evolutionary random walk as a Markov chain on the one- 
dimensional grid. A sequence of type / is i steps away from the 
target, where i is the Hamming distance between this sequence and 
the target. The probability that a type / sequence mutates to a type 
i—\ sequence is given by w;'/(3L). The stochastic process of the 
evolutionary random walk is a Markov chain on the one-dimensional 
grid 0,1,.. .,L. 

The basic recurrence relation 

Consider a Markov chain on the one-dimensional grid, and let 
H(j,i) denote the expected hitting time from i to j. The general 
recurrence relation for the expected hitting time is as follows: 

H(j,i)=\ +P u+ iH(J,i+ \) + P i ^ l H(j,i- I) • /',,//(/./): (1) 

for j<i<L, with boundary condition H(jj) = 0. The interpreta- 
tion is as follows. Given the current state (, if (' # j, at least one 
transition will be made to a neighboring state /', with probability 
Pi/, from which the hitting time is H(j,i'). 

Intuition behind Theorem 1 

Theorem 1 is derived by obtaining precise bounds for the 
recurrence relation of the hitting time (Equation 1). Consider that 
Pk,k-\ >0 for all j<k<i (i.e., progress towards state j is always 
possible), as otherwise j is never reached from t. W e show (see Lemma 
2 in the Text SI) that we can write H(j,i) as a sum, 
H(j,i) = Ylt=L-\ ^n, where b„ is the sequence defined as: 



(0 b 0 = — - — ; 

, 1 +PL-n,L-n+\b n -\ e _ 

(u) b„ = — - for n > 0. 

PL-n,L-n-\ 



(2) 



The basic intuition obtained from Equation 2 is as follows: (i) If 

P L L 1 

— ' > X, for some constant k > 1 , then the sequence b n 

PL-n,L-n-l 

grows at least as fast as a geometric series with factor X. (ii) On the 
other hand, if L — " +1 < 1 and Pi_ ni i_ M _i>a for some 

Ph-n,L-n-\ 

constant a>0, then the sequence b n grows at most as fast as an 
arithmetic series with difference 1 /a. From the above case analysis 



the result for Theorem 1 is obtained as follows: If c< — , then for all 

4 

3 + 4c PL-n,L-n+l ^ , „ , , . 

cL<n< — - — L, we have > A tor some A > 1 , and 

8 PL-n,L-n-\ 
hence the sequence b n grows geometrically for a linear length in L. 

Then, H(cL,i)>X~^~ L for all states i>cL (i.e., for all sequences 

outside of the target set). This corresponds to case 1 of Theorem 1. 

2 p ^ ^ j 
On the other hand, if c> — , then it is — ' < 1, and case 2 of 

Theorem 1 is derived (for details see Corollary 2 in Text SI). 



Intuition behind Theorem 2 

The basic intuition for the result is as follows: consider a single 
search for which the expected hitting time is exponential. Then for 
the single search the probability to succeed in polynomially many 
steps is negligible (as otherwise the expectation would not have 
been exponential). In case of independent searches, the indepen- 
dence ensures that the probability that all searches fail is the 
product of the probabilities that every single search fails. Using the 
above arguments we establish Theorem 2 (for details see Section 8 
in Text SI). 

Intuition behind Theorem 3 

For this result, it is first convenient to view the evolutionary walk 
taking place in the sequence space of all sequences of length L, under 
no selection. Each sequence has 3L neighbors, and considering that a 
point mutation happens, the transition probability to each of them is 
1 

— . The underlying Markov chain due to symmetry has fast mixing 

time, i.e., the number of steps to converge to the stationary 
distribution (the mixing time) is 0(L log L). Again by symmetry 

3 

the stationary distribution is the uniform distribution. If c< -, then 

from Theorem 1 we obtain that the expected time to reach a single 

broad peak is exponential. By union bound, if m<<4 L , the 

probability to reach any of the m broad peaks within 0(L log L) steps 

is negligible. Since after the first 0(L log L) steps the Markov chain 

converges to the stationary distribution, then each step of the process 

can be interpreted as selection of sequences uniformly at random 

among all sequences. Using Hoeffding's inequality, we show that with 

exp(2-(3/4-c) 2 -L) 

high probability, in expectation such steps are 

m 

required before a sequence is found that belongs to the target set. 
Thus we obtain the result of Theorem 3 (for details see Section 9 in 
Text SI). 

Remark about techniques 

An important aspect of our work is that we establish our results 
using elementary techniques for analysis of Markov chains. The 
use of more advanced mathematical machinery, such as 
martingales [49] or drift analysis [50,51], can possibly be used 
to derive more refined results. While in this work our goal is to 
distinguish between exponential and polynomial time, whether 
the techniques from [49-51] can lead to a more refined 
characterization within polynomial time is an interesting direc- 
tion for future work. 

Supporting Information 

Text SI Detailed proofs for "The Time Scale of Evolutionary 

Innovation." 

(PDF) 
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